K Means Clustering Project - Solutions

Usually when dealing with an unsupervised learning problem, its difficult to get a good measure of how well the model performed. For this project, we will use data from the UCI archive based off of red and white wines (this is a very commonly used data set in ML).

We will then add a label to the a combined data set, we'll bring this label back later to see how well we can cluster the wine into groups.

Get the Data

Download the two data csv files from the UCI repository (or just use the downloaded csv files).

Use read.csv to open both data sets and set them as df1 and df2. Pay attention to what the separater (sep) is.

In [10]:
df1 <- read.csv('winequality-red.csv',sep=';')
df2 <- read.csv('winequality-white.csv',sep=';')

Now add a label column to both df1 and df2 indicating a label 'red' or 'white'.

In [11]:
# Lots of ways to do this

# Using sapply with anon functions
df1$label <- sapply(df1$pH,function(x){'red'})
df2$label <- sapply(df2$pH,function(x){'white'})

Check the head of df1 and df2.

In [12]:
head(df1)
Out[12]:
fixed.acidityvolatile.aciditycitric.acidresidual.sugarchloridesfree.sulfur.dioxidetotal.sulfur.dioxidedensitypHsulphatesalcoholqualitylabel
17.40.701.90.07611340.99783.510.569.45red
27.80.8802.60.09825670.99683.20.689.85red
37.80.760.042.30.09215540.9973.260.659.85red
411.20.280.561.90.07517600.9983.160.589.86red
57.40.701.90.07611340.99783.510.569.45red
67.40.6601.80.07513400.99783.510.569.45red
In [13]:
head(df2)
Out[13]:
fixed.acidityvolatile.aciditycitric.acidresidual.sugarchloridesfree.sulfur.dioxidetotal.sulfur.dioxidedensitypHsulphatesalcoholqualitylabel
170.270.3620.70.045451701.00130.458.86white
26.30.30.341.60.049141320.9943.30.499.56white
38.10.280.46.90.0530970.99513.260.4410.16white
47.20.230.328.50.058471860.99563.190.49.96white
57.20.230.328.50.058471860.99563.190.49.96white
68.10.280.46.90.0530970.99513.260.4410.16white

Combine df1 and df2 into a single data frame called wine.

In [14]:
wine <- rbind(df1,df2)
In [15]:
str(wine)
'data.frame':	6497 obs. of  13 variables:
 $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
 $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
 $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
 $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
 $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
 $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
 $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
 $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
 $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
 $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
 $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
 $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
 $ label               : chr  "red" "red" "red" "red" ...

EDA

Let's explore the data a bit and practice our ggplot2 skills!

Create a Histogram of residual sugar from the wine data. Color by red and white wines.

In [16]:
library(ggplot2)
In [37]:
pl <- ggplot(wine,aes(x=residual.sugar)) + geom_histogram(aes(fill=label),color='black',bins=50)
# Optional adding of fill colors
pl + scale_fill_manual(values = c('#ae4554','#faf7ea')) + theme_bw()

Create a Histogram of citric.acid from the wine data. Color by red and white wines.

In [39]:
pl <- ggplot(wine,aes(x=citric.acid)) + geom_histogram(aes(fill=label),color='black',bins=50)
# Optional adding of fill colors
pl + scale_fill_manual(values = c('#ae4554','#faf7ea')) + theme_bw()

Create a Histogram of alcohol from the wine data. Color by red and white wines.

In [40]:
pl <- ggplot(wine,aes(x=alcohol)) + geom_histogram(aes(fill=label),color='black',bins=50)
# Optional adding of fill colors
pl + scale_fill_manual(values = c('#ae4554','#faf7ea')) + theme_bw()

Create a scatterplot of residual.sugar versus citric.acid, color by red and white wine.

In [49]:
pl <- ggplot(wine,aes(x=citric.acid,y=residual.sugar)) + geom_point(aes(color=label),alpha=0.2)
# Optional adding of fill colors
pl + scale_color_manual(values = c('#ae4554','#faf7ea')) +theme_dark()

Create a scatterplot of volatile.acidity versus residual.sugar, color by red and white wine.

In [52]:
pl <- ggplot(wine,aes(x=volatile.acidity,y=residual.sugar)) + geom_point(aes(color=label),alpha=0.2)
# Optional adding of fill colors
pl + scale_color_manual(values = c('#ae4554','#faf7ea')) +theme_dark()

Feel free to explore the data as you see fit, we'll go ahead and move on!

Grab the wine data without the label and call it clus.data

In [65]:
clus.data <- wine[,1:12]

Check the head of clus.data

In [63]:
head(clus.data)
Out[63]:
fixed.acidityvolatile.aciditycitric.acidresidual.sugarchloridesfree.sulfur.dioxidetotal.sulfur.dioxidedensitypHsulphatesalcoholqualitylabel
17.40.701.90.07611340.99783.510.569.45red
27.80.8802.60.09825670.99683.20.689.85red
37.80.760.042.30.09215540.9973.260.659.85red
411.20.280.561.90.07517600.9983.160.589.86red
57.40.701.90.07611340.99783.510.569.45red
67.40.6601.80.07513400.99783.510.569.45red

Building the Clusters

Call the kmeans function on clus.data and assign the results to wine.cluster.

In [74]:
wine.cluster <- kmeans(wine[1:12],2)

Print out the wine.cluster Cluster Means and explore the information.

In [76]:
print(wine.cluster$centers)
  fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1      7.619044        0.4079451   0.2911080       3.082690 0.0656846
2      6.904698        0.2871364   0.3398094       7.259286 0.0486092
  free.sulfur.dioxide total.sulfur.dioxide   density       pH sulphates
1            18.43735             63.54832 0.9945680 3.255147 0.5718655
2            39.82503            155.90101 0.9947956 3.190308 0.5000354
   alcohol  quality
1 10.79529 5.809204
2 10.25832 5.825436

Evaluating the Clusters

You usually won't have the luxury of labeled data with KMeans, but let's go ahead and see how we did!

Use the table() function to compare your cluster results to the real results. Which is easier to correctly group, red or white wines?

In [85]:
table(wine$label,wine.cluster$cluster)
Out[85]:
       
           1    2
  red   1515   84
  white 1310 3588

We can see that red is easier to cluster together, which makes sense given our previous visualizations. There seems to be a lot of noise with white wines, this could also be due to "Rose" wines being categorized as white wine, while still retaining the qualities of a red wine. Overall this makes sense since wine is essentially just fermented grape juice and the chemical measurements we were provided may not correlate well with whether or not the wine is red or white!

It's important to note here, that K-Means can only give you the clusters, it can't directly tell you what the labels should be, or even how many clusters you should have, we are just lucky to know we expected two types of wine. This is where domain knowledge really comes into play.

Great Job!